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Abstract 


No significant body of research examines writing 
achievement and the specific skills and knowledge in 
the writing domain for postsecondary (college) stu- 
dents in the U.S., even though many at-risk students 
lack the prerequisite writing skills required to persist in 
their education. This paper addresses this gap through 
a novel exploratory study examining how automated 
writing evaluation (AWE) can inform our understand- 
ing of the relationship between postsecondary writing 
skill and broader indicators of college success. The ex- 
ploratory study presented in this paper was conducted 
using test-taker essays from a standardized writing as- 
sessment of postsecondary student learning outcomes. 
Findings showed that for the essays, AWE features 
were found to be predictors of broader outcomes 
measures: college success indicators and learning out- 
comes measures. Study findings expose AWE’s poten- 
tial to support educational analytics -- i.e., relationships 
between writing skill and broader outcomes —moving 
AWE beyond writing assessment and instructional use 
cases. 


1 Introduction 


Writing is a challenge, especially for at-risk stu- 
dents who may lack the prerequisite writing skills 
required to persist in U.S. 4-year postsecondary 
(college) institutions (NCES, 2012). Educators 
teaching postsecondary courses that require writ- 
ing could benefit from a better understanding of 

writing achievement and its role in postsecondary 
success (college completion). U.S K-12 research 
examines writing achievement and the specific 
skills and knowledge in the writing domain 
(Berninger, Nagy & Beers, 2011; Olinghouse, 
Graham, & Gillespie, 2015). No parallel signifi- 
cant body of research exists for postsecondary stu- 
dents. There has been research related to essay 
writing on standardized tests and college success 
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indicators for exams, such as the College Board 
Advanced Placement! (Bridgeman & Lewis, 
1994). However, only the final overall essay score 
is evaluated. In this work, we try to drill deeper 
into essays to explore if specific features in the 
writing of college students is related to measures 
of broader outcomes. 

Automated writing evaluation (AWE) systems 
typically support the measurement of pertinent 
writing skills for automated scoring of large-vol- 
ume, high-stakes assessments (Attali & Burstein, 
2006; Shermis et al, 2015) and online instruction 
(Burstein et al, 2004; Foltz et al, 2013; Roscoe et 
al, 2014). AWE has been used primarily for on- 
demand essay writing on standardized assess- 
ments. However, the real-time, dynamic nature of 
NLP-based AWE affords the ability to explore 
linguistic features and skill relationships across a 
range of writing genres in postsecondary educa- 
tion, such as, on-demand essay writing tasks, ar- 
gumentative essays from the social sciences, and 
lab reports in STEM courses (Burstein et al, 
2016). Such relationships can provide educational 
analytics that could be informative for various 
stakeholders, including students, instructors, par- 
ents, administrators and policy-makers. 

This paper discusses an exploratory secondary 
data analysis, using AWE to examine interactions 
between writing and broader outcomes measures 
of student success. An evaluation was conducted 
using test-taker essays from a standardized writing 
assessment of postsecondary student learning out- 
comes. Findings suggested that AWE features 
from the essays were found to be predictors of 
broader outcomes measures: college success indi- 
cators and learning outcomes measures. Recent 
work has shown similar results, examining rela- 
tionships between AWE and read ing skills (Allen 
et al, 2016) versus broader outcomes measures 
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Figure 1. Construct representation of the 
AWE features extracted from pilot study es- 
says. 


(discussed here). 

The work presented here broadens the lens -- 
exposing AWE’s potential to inform our under- 
standing of the relationship between writing and 
critical educational outcomes above and beyond 
prevalent use cases for assessment and instruction 
of writing itself. 


2 The Study 


An exploratory secondary data analysis was con- 
ducted to examine relationships between re- 
sponses to a 45-minute, timed standardized writ- 
ing assessment of postsecondary student learning. 
The writing assessment contains two components: 
an on-demand essay task requiring students to 
compose an essay in response to a prompt wherein 
they must adopt or defend a position or a claim 
presented in the prompt; and 15 selected-response 
(SR) (multiple choice) items related to one read- 
ing passage. The SR portion measures writing do- 
main knowledge skills, such as English conven- 
tions, vocabulary choice, evaluating evidence, an- 
alyzing arguments, understanding the language of 
argumentation, evaluating organization, distin- 
guishing between valid and invalid arguments, 
and evaluating tone. The writing assessment is 
one of three component skills assessments from 
an outcomes assessment suite. A second critical 
thinking component test is also used for this study. 
It is also a 45-minute, timed assessment, com- 
posed of 27 or 29 selected-response items depend- 
ing on the test form (i.e., version of a test). The 
pilot study includes 5 forms (versions) for the crit- 
ical thinking test. The five forms were developed 
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under the same test specification and their scores 
were linked to each other and can be used inter- 
changeably (Liu, et al., 2016). 

In this study, we examine relationships between 
AWE features found in essay responses of 4-year 
postsecondary students who took the writing as- 
sessment, and indicators of college success. 


2.1 Data 


To evaluate the psychometric properties of the 
assessment and to gather evidence on the reliabil- 
ity and validity of the test prior to its release, the 
authors’ organization had previously conducted 
an extensive pilot test of the assessment at more 
than 33 colleges and universities. Analyses used 
all data collected from 929 students (37% first- 
year, 29% sophomores, 16% junior, and 18% sen- 
iors) enrolled at the institutions; students had 
completed one of two pilot forms of the writing 
assessment. Of the 929 students, 514 also had 
scores from the pilot critical thinking assessment. 

In addition to the writing assessment essay 
text, the pilot test data includes human ratings for 
the essay responses, and selected-response items 
scores. We also had access to students’ college 
GPA and some external measures such as, the 
critical thinking assessment scores, SAT? or 
ACT? scores, high school grade point average 
(GPA). Although these variables were missing for 
subsamples of students. 


2.2. Methods 


Several hundred AWE features were generated 
for the essay writing data. These features were 
drawn from a large portfolio of features used for 
analysis of student writing (including features 
from a commercial essay scoring engine). As this 
was an initial exploratory analysis, one of the au- 
thors selected an initial, manageable set of 61 con- 
struct-relevant features related to subconstructs, 
including English writing conventions (e.g., er- 
rors in grammar and mechanics), coherence (e.g., 
flow of ideas), organization and development, vo- 
cabulary, and topicality. See Figure 1 (above). 
The author hypothesized that this 61-feature sub- 
set would have strong predictive potential based 
on the subconstruct that each feature was intended 
to address, and its alignment with the writing as- 
sessment construct. 
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Feature Name Subconstruct Class NLP-Based Feature / Resource Description 


Detection of sentences containing argumentation (Beigman 


argumentation argumentation Klebanov et al, 2017) 

Aggregate discourse coherence quality measure (Somasundaran 
dis_cohl coherence et al, 2014) 

Latent semantic analysis values computed for long-distance 
gen_max lsa coherence sentence pairs (Somasundaran et al, 2014) 
dis_coh2, dis_coh3, Three measures related to topic distribution in a text (Beigman 
dis _coh4 coherence Klebanov et al, 2013; Burstein et al, 2016) 


Noun phrase collocations identified using a rank-ratio based 
collocation detection algorithm trained on the Google Web1T 
fphajnp collocation n-gram corpus (Futagi et al, 2008) 

Aggregate value based on length of essay-based discourse ele- 
ment (Attali & Burstein, 2006) derived from a discourse struc- 
ture detection method that identifies essay-based discourse ele- 


logdta discourse ments (e.g., thesis statement) (Burstein et al, 2003) 

Aggregate value generated for relative grammaticality (Heil- 
grammaticality English conventions man et al, 2014) 

Aggregate value from a set of 9 automatically-detected gram- 
logg English conventions mar error feature types (Attali & Burstein, 2006) 

Aggregate value from a set of 12 automatically-detected me- 
nsqm English conventions chanics error feature types (Attali & Burstein, 2006) 

Aggregate value from a set of 10 automatically-detected word 
nsqu English conventions usage error feature types (Attali & Burstein, 2006) 

Count measures using a manually-compiled list of stative verbs 
statives natrativity (i.e., express states vs. action, e.g., feel). 

Aggregate scores generated related to use of personal reflection 
PRI, PR2 personal reflection language (Beigman Klebanov et al, 2017) 


Noun phrases identified with a hyphenated adjective or a prep- 
ositional phrase modifier using regular expressions defined on 


complexnp phrasal complexity constituency parses. 
Aggregate value generated based on sentence-type factors 
svf sentence variety (Burstein et al, 2013) 
Detection of main topics and related words (Beigman Klebanov 
topicdev topic development et al , 2013; Burstein et al, 2016) 
vocabulary sophistica- | Aggregate measure generated related to word frequency (Attali 
nwf median tion & Burstein (2006) 
vocabulary sophistica- | Aggregate measure generated related to average word length 
wordIn 2 tion for all words in a text (Attali & Burstein, 2006) 


Detection of morphologically complex inflectional (variants 1) 
and derivational (variants2) word forms using an algorithm that 
first over-generates variants using rules and then filters using 
co-occurrence statistics computed over Gigaword. (Madnani et 


variants1, variants2 vocabulary usage al, 2016) 
Detection of metaphor (Beigman Klebanov et al (2015); Beig- 
metaphor vocabulary usage man Klebanov et al (2016) 
seutinedit vocabulary usage Count measures based on VADER‘ sentiment lexicon entries. 
Aggregate feature composed of a number of text-based vocabu- 
lary-related measures (e.g., morphological complexity, related- 
vocab_richness vocabulary usage ness of words in a text). This work is not yet published. 
Aggregate measure related to collocation and preposition use 
colprep vocabulary usage (described in Burstein et al, 2013). 


Table 1: The 26 Features, Subconstructs & Methods 
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Before modeling the interactions between the 
61 AWE features and other measures, an analysis 
was conducted to identify features that were func- 
tionally related or strongly correlated to remove 
redundant features. This analysis identified 35 
features that were monotonic functions of other 
features (e.g., one feature equaled the log of a sec- 
ond features), very highly linearly correlated, or 
have very small variance. Among features that 
were functionally related or highly correlated, the 
feature most highly correlated with human ratings 
of the essay were retained. The outcome of this 
analysis was the set of 26 features listed in Table 
1 (below). Only the 26 features in this subset were 
used for the analysis reported here. 

The analysis consisted of linear regression anal- 
yses with the AWE features as the independent (or 
predictor) variables and scores on the critical 
thinking assessment, SAT or ACT, writing as- 
sessment selected-response (SR) items, and col- 
lege GPA as the dependent variables. Separate re- 
gression analyses were conducted for each de- 
pendent variable. For example, there was a model 
predicting GPA as a function of argumentation, 
another model predicting GPA as function of 
dis_cohl, another model predicting GPA as a 
function of gen_max_/sa, and so on for each of the 
features. This modeling process was repeated for 
each of the dependent variables. The goal of the 
analysis was to determine how strongly each fea- 
ture was related to each outcome. However, since 
better writers will probably get better scores on 
other tests too, we wanted to know if the features 
contained unique information for predicting the 
dependent variables, above and beyond how well 
the essay was written. That is, we wanted to know 
if two students who appear to be comparable writ- 
ers based on human scores can be further differ- 
entiated by the additional properties of their writ- 
ing as captured by AWE. Therefore, for each de- 
pendent variable, a series of regression models 
were fit that predicted the dependent variable not 
only as a function of each of the feature values, 
but also included the length of the essay and the 
average of the human ratings on a 6-point scale 
(where | indicates the lowest proficiency and 6, 
the highest). The regression models included 
these two additional predictors because both are 
related to the quality of the essay. Essay length is 


generally a good predictor of human ratings of es- 
says and related to many AWE features (Cho- 
dorow & Burstein, 2004). By including these two 
additional predictors in the model, we were better 
able to isolate the relationship between the fea- 
tures and the dependent variable distinct from 
quality of the essay. 


3. Results 


Tables 2 to 8 (below) present the results of the re- 
gression analyses for each of the 6 outcomes. For 
presentation purposes, the table for each depend- 
ent variable includes only those features where the 
coefficient for that feature was significantly 
greater than zero with a p-value less than 0.05. 
Across all the dependent variables, 25 of the 26 
variables appear in the table for one or more de- 
pendent variables. Only one feature, metaphor, 
did not emerge from the analyses. Given that 26 
features were tested for each dependent variable, 
there is a considerable chance that p-values below 
0.05 were sometimes due to chance and did not 
indicate a statistically significant relationship. 
Controlling for multiple comparisons would be 
required to reduce the probability of spurious p- 
values of less than 0.05. P-values were used to re- 
duce the size of the tables and focus on features 
with the strongest evidence of a relationship with 
each dependent variable. 

Each row contains a standardized coefficient 
from a model that included 3 features: (1) the 
AWE feature, (2) the square root of the number of 
words (length), and (3) the raw average of 2-3 hu- 
man ratings per essay. In addition to the coeffi- 
cient for the AWE feature and its standard error, 
the table includes the overall R-squared (R’) for 
the three independent variables (AWE feature, 
length, and average human rating) and the part of 
the R-squared attributable to the AWE features 
(Inc. R*). The R? measures the variance explained 
by the predictor. 

All features in the tables explain some amount 
of variance showing promise of relationships be- 
tween AWE features and college success and 
learning outcomes. Results show that for all out- 
comes, a breadth of features emerge, covering the 
English conventions, coherence or argumenta- 
tion, and vocabulary subconstructs. Features 
shown in italics in Tables 2-8 indicate relatively 


stronger predictors (1.e., greater explained vari- 
ance), using Inc. R’ of 0.05 as a “cutoff”. Vocab- 
ulary sophistication (“word|n_2”) and vocabulary 
usage (“vocab _ richness”) were the stronger pre- 
dictors of the critical thinking assessment scores, 
the SAT/ACT Composite Score and SAT Ver- 
bal Score. Vocabulary usage (“sentiment”) 
was a stronger predictor in ACT Science. 


2 Discussion and Future Work 


This exploratory, secondary data analysis illus- 
trates that 1) writing can provide meaningful in- 
formation about student knowledge related to 
broader outcomes (college success indicators and 
learning outcomes measures) and 2) AWE has 
greater potential for educational analytics above 
and beyond current prevalent uses for writing as- 
sessment and instruction. Vocabulary features 
were the most consistent and strongest predictors. 
This is not surprising since most of the college 
success predictors used in this study involved in- 
tensive reading, and vocabulary knowledge is 
shown to be related to reading comprehension 
(Qian & Schedl, 2004; Quinn et al, 2015). The 
detailed analyses illustrated in Tables 2 — 8 do 
show statistically significant relationships be- 
tween the full set of writing skill feature measures 
and broader outcomes. The big picture is that this 
line of research could inform instructional curric- 
ulum, assessment development, and educational 
policy vis-a-vis the improvement of college stu- 
dent success factors. 

The imitations of this project are the small size 
of the data set since students were missing some of 
the dependent variables, and the examination of 
writing data from a single writing genre —1.e., on- 
demand essay writing. However, these will be ad- 
dressed in next steps, in Fall 2017-Spring 2018. 
The authors will conduct a larger study with seven 
4-year postsecondary partner institutions. A larger 
sample of student writing will be collected from ap- 
proximately 2,000 students from the sites. Student 
writing data collected will include not only on-de- 
mand essay writing, but students will each also pro- 
vide multiple authentic writing assignments from 
their courses. Both writing and disciplinary courses 
will be included in the study. Student success factor 
data, such as, SAT and ACT scores, college GPA, 
course grades, and course completion, will also be 


collected. We will administer the same writing as- 
sessment and critical thinking assessment to our 
outcomes measures. Using the new data, we will 
apply knowledge from this study to continue to 
evaluate how AWE can provide analytics related to 
broader outcomes measures. Further, this larger 
data set will span different genres which will afford 
the opportunity to 1) replicate this exploratory 
study on the same writing assessment as a baseline, 
and 2) apply current and enhanced analyses to au- 
thentic writing data collected from college stu- 
dents. 

AWE has traditionally been used for writing 
assessment (automated essay scoring), and writ- 
ing instruction (automated feedback about writ- 
ing). The work presented in this paper explores 
new territory, and brings awareness to the poten- 
tial impact of NLP in a bigger educational space — 
1.e., to support understanding of relationships be- 
tween writing and broader outcomes of student 
success. 


Std. 
Variable Coeffcient |Error R? Inc. R? 
logg 0.10 0.04 0.22 0.01 
Insqu 0.17 0.04 0.24 0.02 
Insqm 0.11 0.04 0.22 0.01 
svf 0.27 0.06 0.25 0.03 
nwf_median 0.18 0.04 | 0.24 0.03 
wordin_2 0.25 0.04 0.27 0.06 
PR1 -0.08 0.04 0.22 0.01 
fphajnp 0.08 0.04 0.22 0.01 
complexnp 0.12 0.04 0.23 0.01 
variants 1 0.23 0.04 0.26 0.04 
vocab_richness| 0.27 0.05 0.26 0.05 
dis_cohl 0.40 0.13 0.23 0.01 
sentiment 0.15 0.04 0.23 0.02 


Table 2: Critical Thinking Composite 
Score; Baseline R? with human rating and 
length = 0.21 


Std. Coeffi- (Std. Er- 

Variable Coefficient) Error | R? | Inc. R? Variable cient ror R’ Inc. R? 
nsqu 0.12 0.03 |0.23| 0.01 logg 0.11 0.04) 0.18 0.01 
Insqm 0.21 0.03 {0.25} 0.04 nsqu 0.14 0.04; 0.18 0.02 
svf 0.11 0.04 {0.22} 0.01 nsqm 0.15 0.04; 0.18 0.02 
wordIn_ 2 0.19 0.03 |0.24| 0.03 svf 0.29 0.06} 0.21 0.04 
grammaticality 0.12 0.03 |0.22| 0.01 nwf_ median 0.15 0.04] 0.19 0.02 
colprep 0.08 0.03 10.22} 0.01 wordIn_2 0.29 0.04| 0.24 0.07| 
dis coh3 -0.10 0.03 10.22] 0.01 grammaticality 0.11 0.05) 0.17 0.01 
dis _coh4 -0.11 | 0.05 |0.22| 0.00 colprep 0.12 0.05|_ 0.18 0.01 
fphajnp 0.11 0.03 10.22] 0.01 argumentation 0.13 0.05} 0.18 0.01 
complexnp 0.08 | 0.03 |0.22} 0.01 PRI -0.15 0.04} 0.19] 0.02 
variants2 0.13 0.03 |0.22] 0.01 PR2 -0.12 0.05/_ 0.18 0.01 
vocab_ richness 0.13 0.03 10.22] 0.01 fphajnp 0.11 0.05} 0.17 0.01 
dis cohl 0.23 0.09 |0.22| 0.01 complexnp 0.12 0.05|_ 0.18 0.01 
sentiment 0.06 0.03 10.22) 0.00 variants 1 0.13 0.05; 0.18 0.01 
statives -0.13__ | 0.03 |0.23} 0.02 variants2 0.22 0.05|_ 0.20) __0.04 
gen_max_lIsa5 -0.13 0.06} 0.17 0.01 
Table 3: Writing Assessment Selected Re- vocab _r ichness 0.33 0.05|_ 0.23 0.07 
sponse Score; Baseline R? with human rating Secon 0.28 0.13] 0.17 0.01 
sentiment 0.12 0.04} 0.18 0.01 


and length = 0.21 


Table 5. SAT Verbal Score; Baseline R? with hu- 
man rating and length = 0.16 


Variable Coefficient |Std. Error | R? [Inc. R? 
Insqm 0.22 0.05 0.14} 0.04 
svf 0.19 0.06 |0.12| 0.02 
nwf median 0.14 0.05 0.12} 0.02 
wordIn 2 0.20 0.05 0.14} 0.03 
colprep 0.10 0.05 0.11} 0.01 
PRI -0.12 0.05 0.12] 0.01 
PR2 -0.13 0.05 0.11} 0.01 
fphajnp 0.10 0.05 0.11} 0.01 
complexnp 0.11 0.05 0.11} 0.01 
variants2 0.15 0.05 0.12] 0.02 
gen_max_lsa -0.16 0.07 0.11} 0.01 
vocab_ richness 0.24 0.05 0.14} 0.04 
sentiment 0.18 0.04 |0.13] 0.03 


Std. 
Variable Coefficient| Error | R? | Inc. R? 
logg 0.09 0.04 [0.17] 0.01 
nsqu 0.10 0.04 | 0.17] 0.01 
Insqm 0.17 0.04 | 0.18] 0.03 
svf 0.25 0.05 |0.19 | 0.03 
nwf median 0.14 0.04 |0.18 | 0.02 
wordin_2 0.25 0.04 |0.21| 0.06 
grammaticality 0.08 0.04 |0.16| 0.01 
colprep 0.10 0.04 |0.17] 0.01 
PR1 -0.12 0.04 |0.17| 0.01 
PR2 -0.12 0.04 |0.17| 0.01 
fphajnp 0.13 0.04 [0.18] 0.02 
complexnp 0.12 0.04 {0.17} 0.01 
variants2 0.20 0.04 |0.19| 0.03 
gen_max _lIsa5 -0.12 0.06 |0.16| 0.01 
vocab_richness 0.31 0.04 |0.22| 0.06 
dis_cohl 0.26 0.12 |0.16] 0.01 
sentiment 0.17 0.04 |0.19| 0.03 


Table 4: SAT/ACT Composite Score (ACT 
rescaled to the SAT Scale); Baseline R? with 
human rating and length = 0.16 


Table 6. SAT Math Score; Baseline R? with hu- 
man rating and length = 0.10 


ACT English variants | 0.17 0.06 | 0.10 0.02 
Std. vocab_richness | 0.26 | 0.07] 0.12 | 0.04 
Variable Coefficient/Error| R? | Inc. R? ; 
sentiment 0.23 0.06 | 0.12 0.05 
Insqu 0.11 0.05 | 0.16 0.01 
Insqm 0.15 0.05 | 0.17 0.02 Table 7. ACT Subject Test Scores; Baseline R? 
logdta -0.19 |0.06| 0.18 0.03 with human rating and length: ACT English = 
ioe 0.17 0.07! 0.17 0.02 0.15; ACT Math = 0.11; ACT Reading = 0.13; 
ACT Science = 0.08 
wordin 2 0.16 0.06 | 0.18 0.02 
dis coh2 0.21 0.11 | 0.16 0.01 Std. 
argumentation 0.16 0.07 | 0.17 0.01 Variable Coefficient |Error| R? | Inc. R? 
vocab_richness 0.16 0.06 | 0.17 0.02 
sentiment 0.24 [0.07] 0.19 | 0.03 | FS O.f6 1 0.00 10.07! 0.02 
ACT Math wordin 2 0.13 0.05 |0.06| 0.02 
Std. eT 
Variable Coefficient/Error| R?_|Inc. R? grammaticality 0.13 0.05 [0.06] 0.01 
svf 0.18 0.07 | 0.12 0.02 argumentation 0.13 0.06 {0.05} 0.01 
din 2 : .0 wl .02 : 
— OS = 1 Oe eee topicdev 0.10 | 0.05 |0.05] 0.01 
complexnp 0.16 0.06 | 0.13 0.02 
vahiants) 0.15 0.06) 0.12 0.02 vocab_richness 0.12 0.05 [0.05] 0.01 
vananiss oS O06 Owe Table 8. Cumulative GPA; Baseline R? with hu- 
vocab_richness 0.21 0.07 | _ 0.13 0.03 man rating and length = 0.04 
dis cohl 0.38 0.17 | 0.12 0.02 
sentiment 0.19 | 0.06} 0.14 | 0.03 Acknowledgements 
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